Named Entity Recognition and Disambiguation in Tweets Master Thesis

نویسندگان

  • Soumya Ranjan Patra
  • Mykola Pechenizkiy
  • Erik Tromp
  • Alexander Serebrenik
چکیده

Social media has grown exponentially over the past few years. Users are generating far more unstructured content than ever before. Successful companies are also very active in social media analysing these data for their marketing campaigns. But the informal and noisy nature of such data makes it quite difficult to extract meaningful information out of them. In this thesis, we investigate the problem of named entity recognition and disambiguation in tweets. Given a tweet written in English, the task focuses on extracting all the named entities inside and assigning each entity with a reference link. It has potential applications in online product monitoring, sentiment analysis, electoral predictions and other social media analytics. We present a five step approach for this problem consisting of tokenization, Part-of-speech tagging, normalization, mention extraction and disambiguation. First three steps are used as preprocessing steps for mention extraction and disambiguation which are the core components of our approach. For tokenization and partof-speech tagging, we use the regular expression tokenizer and tagger developed by Owoputi et al. In the third step, we propose a normalization algorithm which uses Brown clusters and Microsoft web ngram language model to normalize the tweets. In the next step, our mention extractor extracts possible candidate entities with the help of POS tags and ngram matching against Freebase. Finally, using the intra-tweet context and relationship among entities, our disambiguation algorithm assigns appropriate reference links to these extracted entities. To evaluate our approach, We carry out several experiments and benchmark our complete process against different state of the art extraction and disambiguation systems. We also evaluate the importance of individual steps by measuring the overall performance degradation by leaving that step out of the pipeline. Finally, we present a case study demonstrating how our approach can be applied practically in a real world scenario.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and Disambiguation

In this demo paper, we present NEED4Tweet, a Twitterbot for named entity extraction (NEE) and disambiguation (NED) for Tweets. The straightforward application of state-of-the-art extraction and disambiguation approaches on informal text widely used in Tweets, typically results in significantly degraded performance due to the lack of formal structure; the lack of sufficient context required; and...

متن کامل

Analysis of named entity recognition and linking for tweets

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline,...

متن کامل

A Generic Open World Named Entity Disambiguation Approach for Tweets

Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper, we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. Th...

متن کامل

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

This paper describes the participation of the UNIBA team in the Named Entity rEcognition and Linking (NEEL) Challenge. We propose a knowledge-based algorithm able to recognize and link named entities in English tweets. The approach combines the simple Lesk algorithm with information coming from both a distributional semantic model and usage frequency of Wikipedia concepts. The algorithm perform...

متن کامل

Knowledge-based Approach for Event Extraction from Arabic Tweets

Tweets provide a continuous update on current events. However, Tweets are short, personalized and noisy, thus raises more challenges for event extraction and representation. Extracting events out of Arabic tweets is a new research domain where few examples – if any – of previous work can be found. This paper describes a knowledge-based approach for fostering event extraction out of Arabic tweet...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014